Parallelizing single patch pass clustering
نویسندگان
چکیده
Clustering algorithms such as k-means, the self-organizing map (SOM), or Neural Gas (NG) constitute popular tools for automated information analysis. Since data sets are becoming larger and larger, it is vital that the algorithms perform efficient for huge data sets. Here we propose a parallelization of patch neural gas which requires only a single run over the data set and which can work with limited memory, thus it is very efficient for streaming or massive data sets. The realization is very general such that it can easily be transferred to alternative prototype-based methods and distributed settings. Approximately linear relative speed-up can be observed depending on the number of processors.
منابع مشابه
Single pass clustering for large data sets
The presence of very large data sets poses new problems to standard neural clustering and visualisation algorithms such as Neural Gas (NG) and the SelfOrganising-Map (SOM) due to memory and time constraints. In such situations, it is no longer possible to store all data points in the main memory at once and only a few, ideally only one run over the whole data set is still affordable to achieve ...
متن کاملPatch Relational Neural Gas - Clustering of Huge Dissimilarity Datasets
Clustering constitutes an ubiquitous problem when dealing with huge data sets for data compression, visualization, or preprocessing. Prototype-based neural methods such as neural gas or the self-organizing map offer an intuitive and fast variant which represents data by means of typical representatives, thereby running in linear time. Recently, an extension of these methods towards relational c...
متن کاملGenIc: A Single-Pass Generalized Incremental Algorithm for Clustering
In this paper we introduce a new single pass clustering algorithm called GenIc designed with the objective of having low overall cost. We examine some of the properties of GenIc and compare it to windowed k-means. We also study its performance using experimental data sets obtained from network monitoring.
متن کاملJSweep: A Patch-centric Data-driven Approach for Parallel Sweeps on Large-scale Meshes
In mesh-based numerical simulations, sweep is an important computation pattern. During sweeping a mesh, computations on cells are strictly ordered by data dependencies in given directions. Due to such a serial order, parallelizing sweep is challenging, especially for unstructured and deforming structured meshes. Meanwhile, recent high-fidelity multi-physics simulations of particle transport, in...
متن کاملNetwork Topic Detection Model Based on Text Reconstructions
Single pass clustering algorithm is widely used in topic detection and tracking. It is a key part of network topic detection model. In the process of single pass algorithm, clustering results are not satisfactory, and the similarity matching would be reduced. Focusing on these two defects, this paper physically reconstructs web information into a volume, in which every document contains “theme ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008